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Thursday, August 03, 2017 


The State of Open Access: Some New Data 


A preprint posted on PeerJ yesterday offers some new insight into the 
number of articles now available on an open-access basis. 


The new study is different to previous ones in a number of ways, not least 
because it includes data from users of Unpaywall, a browser plug-in that 
identifies papers that researchers are looking for, and then checks to see 
whether the papers are available for free anywhere on the Web. 


Unpaywall is based on oaDOI, a tool 
that scours the web for open-access full- 
text versions of journal articles. 


Both tools were developed by 
Impactstory, a non-profit focused on 
open-access issues in science. Two of 
the authors of the PeerJ preprint — 
Heather Piwowar and Jason Priem — 
founded Impactstory. They also wrote 
the Unpaywall and oaDOI software. 


The paper — which is called The State of 
OA: A large-scale analysis of the prevalence and impact of Open Access 
articles — reports that 28% of the scholarly literature (19 million articles) is 
now OA, and growing, and that for recent articles the percentage available 
as OA rises to 45%. 


The study authors say they also found that OA articles receive 18% more 
citations than average. 


In addition, the authors report on what they describe as a previously under- 
discussed phenomenon of open access — Bronze OA. This refers to articles 
that are made free-to-read on the publisher s website without an explicit 
open licence. 


Below I publish a Q&A with Heather Piwowar about the study. 


Note: my questions were based on an earlier version of the article I saw, and a couple of the quotes I cite were changed in 
the final version of the paper. Nevertheless, all the questions and the answers remain relevant and useful so I have not 
changed any of the questions. 


The interview 


RP: What is new and different about your study? Do you feel it is more 
accurate than previous studies that have sought to estimate how much of 
the literature is OA, or is it just another shot at trying to do that? 


HP: Our study has a few important differences: 


We look at a broader range of the literature than previous studies and go 
further back (to pre-1950 articles), we look at more articles (all of Crossref, 
not just all of Scopus or Web of Science — Crossref has twice the number of 
articles that Scopus has), and we take a larger sample than most other 
studies. That’s because we classify OA status algorithmically, rather than 
relying on manual classification. This allowed us to sample 300k articles, 
rather than a few hundred as many OA studies have done. So, our sample is 
more accurate than most; and more generalizable as well. 


We undertook a more detailed categorization of OA. We looked not just at 
Green and Gold OA, but also Hybrid, and a new category we call Bronze 
OA. Many other studies (including the most comparable to ours, the 
European Commission report you mention below) do not bring out all these 
categories specifically. (I will say more on that below). Furthermore, we 
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didn’t include Academic Social Networks. Mixing those with publisher- 
hosted free-to-read content makes the results less useful to policy makers. 


Our data and our methods are open, for anyone to use and build upon. 
Again, this is a big difference from the Archambault et al. study (that is, the 
one commissioned by the European Commission) and we think it is an 
important difference. 


We include data from Unpaywall users, which allows us to get a sense of 
how much of the literature is OA from the perspective of actual readers. 
Readers massively favour newer articles, for instance, which is good news 
because such articles are more likely to be OA. By sampling actual reader 
data, from people using an OA tool that anyone can install, we can report 
OA percentages that are more realistic and useful for many real-world 
policy issues. 


RP: You estimate that at least 28% of the scholarly literature is open 
access today. OA advocates tend nowadays to cite the earlier European 
Commission report which, the EU claims, indicates that back in 2011 
nearly 50% of papers were OA. Was the EU study an overestimate in your 
view, or has there been a step backwards? 


HP: Their 50% estimate was of recent papers, and included papers posted 
to ResearchGate (RG) and Academia.edu as open access. Our 28% estimate 
is for all journal articles, going back to 1900 — everything with a DOI. We 
found 45% OA for recent articles, and that’s excluding RG and Academia. 
So, they are pretty similar estimates. 


RP: In fact, you came up with a number of different percentages. Can you 
explain the differences between these figures, why it is important to make 
these distinctions, and what the implications of the different figures are? 


HP: There are two summary percentages: 28% OA for all journal articles, 
and 47% OA for journal articles that people read. As I noted, people read 
more recent articles, and more recent articles are more likely to be OA, so it 
turns out that almost half of the papers people are interested in reading right 
now are actually OA. Which is really cool! 


Actually, when you consider that we used automated methods that missed a 
bit of OA it is more than half, so the 47% is a lower bound. 


RP: You coin a new definition of open access in your paper, what you call 
Bronze OA. Can you say something about Bronze OA and its 
implications? It seems to me, for instance, that a lot of papers (over half?) 
currently available as open access are vulnerable to losing their OA 
status. Is that right? If so, what can be done to mitigate the problem? 


HP: Yes, we did think we were coining a new term. But this morning I 
learned we weren’t the first to use the term Bronze OA — that honour goes to 
Ged Ridgway, who posted the tweet below in 2014 
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Our definition of 
Bronze OA is the 
same as Ged’s: 
articles made free-to- 
read on the 
publisher’s website, without an explicit open license. This includes Delayed 
OA and promotional material like newsworthy articles that the publishers 
have chosen to make free but not open. 


It also includes a surprising number of articles (perhaps as much as half of 
the Bronze total, based on a very preliminary sample) from entirely free-to- 
read journals that are not listed in DOAJ and do not publish content under 
an open license. Opinions will differ on whether these are properly called 
“Gold OA” journals/articles; in the paper, we suggest they might be called 
“Dark Gold” (because they are hard to find in OA indexes) or “Hidden 
Gold.” We are keen to see more research on this. 


More research is also needed to understand the other characteristics of 
Bronze OA. Is it disproportionately non-peer-reviewed content (e.g. front- 
matter), as seems likely? How much of Bronze OA is also Delayed OA? 
How much Bronze is Promotional, and how transient is the free-to-read 
status of this content? How many Bronze articles are published in “hidden 
gold” journals that are not listed in the DOAJ? Why are these journals not 
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defining an explicit license for their content, and are there effective ways to 
encourage them to do so? 


This kind of follow-up research is needed before we can understand the 
risks associated with Bronze and what kind of mitigation would be helpful. 


RP: You say in your paper, “About 7% of the literature (and 17% of the 
OA literature) is Green, and this number does not seem to be growing at 
the rate of Gold and Hybrid OA.” You also suspect that much of this green 
OA is “backfilling” repositories with older articles, which are generally 
viewed as being of less value. What happened to the OA dream articulated 
by Stevan Harnad in 1994, and what future do you predict for green OA 
going forward? 


HP: First, I should clarify: our definition of Green OA for the purposes of 
the study is that a paper is in a repository and is not available for free on the 
publisher site. This is so we don’t double count articles as both Green and 
Gold (or Hybrid or Bronze) for our analysis. 


We gave publisher-hosted locations the priority in our classifications 
because we suspect most people would rather read papers there. So, in our 
article when we say green OA isn’t growing, what we mean is that more 
recent papers that are only available in repositories are available as Green 
OA at roughly the same rate as older papers. 


It is worth future study to understand this better. I have a suspicion: perhaps 
much of what would have been Green OA became Bronze and what we call 
“shadowed green” — where there is a copy in a repository and a freely 
available copy on the publisher’s site as well. I suspect publishers 
responded to funder mandates that require self-archiving by making the 
paper free on the publisher sites as well, in synchronized timing. 


Specifically, Biomed doesn’t look like it has as much Green as I’d expect, 
given the success of the NIH mandate and the number of articles in PMC. 
We do know many biomed journals have Delayed OA policies, which we 
categorized as Bronze in our analysis. Did they implement these Delayed 
OA policies in response to the PMC mandates? Perhaps others already 
know this to be true... I haven’t had a chance to look it up. Anyway. I think 
the interplay between Green and Bronze is especially worth more 
exploration. 


We do also report on all the articles that are deposited in repositories, Green 
plus shadowed green, in the article’s Appendices. We found the proportion 
of the literature that is deposited in repositories to be higher for recent 
publication years. 


One final note: We actually changed the sentence that you quoted in the 
final version of our paper, because we were wrong to talk about “growing” 
as we did. Our study didn’t measure when articles were deposited in 
repositories, but just looked at their publication year. Other studies have 
demonstrated that people often upload papers from earlier years, a practice 
called backfilling. 


I suppose in some ways these have less value, because they are read less 
often. That said, anyone who really needs a particular paper and doesn’t 
otherwise have access to it is surely happy to find it. 


RP: You also looked at the so-called citation advantage and estimate that 
an OA article is likely to attract 18% more citations than average. The 
citation advantage is a controversial topic. I don’t want to appear too 
cynical, but is not the idea of trying to demonstrate a citation advantage 
more an advocacy tool than a meaningful notion. I note, for instance, that 
Academia.edu has claimed that posting papers to its network provides a 
73% citation advantage. Surely the real point here is that if all papers 
were open access there would be no advantage to open access from a 
citation point of view? 


HP: That’s true! And that’s the world I’d love to see — one where the 
citation playing field is flat, because everyone can read everything. 


RP: What would you say were the implications of your study for the 
research community, for librarians, for publishers and for open access 
policies? 


HP: For the research community: Install Unpaywall! You'll be able to read 
half the literature for free. Self-archive your papers, or publish OA. 


For OA/bibliometrics researchers: Build on our open data and code, let’s 
learn more about OA and where it’s going. 
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For librarians: Use this data to negotiate with publishers: Half the literature 
is free. Don’t pay full price for it. 


For publishers: Half the literature is now free to read. That percentage is 
growing. You don’t need a weathervane to know which way the wind 
blows: long term, there’s no money in selling things that people can get for 
free. Flip your journals. Sell services to authors, not access to content — it’s 
an increasingly smart business decision, as well as the Right Thing To Do. 


For open access policy makers: We need to understand more about Bronze. 
Bronze OA doesn’t safeguard a paper’s free-to-read status, and it isn’t 
licensed for reuse. This isn’t good enough for the noble and useful content 
that is Scholarly Research. Also: let’s accelerate the growth. 


You didn’t ask about tool developers. An increasing number of people are 
making tools that they can integrate OA into. They should use the oaDOI 
service. Now that such a large chunk of the literature is free, there are a lot 
of really transformative things we can build and do — in terms of knowledge 
extraction, indexing, search, recommendation, machine learning etc. 


RP: OA was at the beginning as much (in fact more) about affordability 
as about access (certainly from the perspective of librarians). I note the 
recently published analysis of the RCUK open access policy reports that 
the average APC paid by RCUK rose by 14% between 2014 and 2016, and 
that the increase was greater for those publishers below the top 10 (who 
are presumably focused on catching up with their larger competitors). 
Likewise, the various flipping deals we are seeing emerge are focused on 
no more than transferring costs from subscriptions to APCs, with no 
realistic expectation of prices falling in the future. If the research 
community could not afford the subscription system (which OA advocates 
have always maintained) how can it afford open access in the long-term? 


HP: If the rising APCs are because small publishers are catching up with 
the leaders by raising prices, that won’t continue forever — they’ll catch up. 
Then it’ll work like other competitive marketplaces. 


The main issue is freeing up the money that is currently spent on 
subscriptions. We think studies like this, and tools like Unpaywall, can be 
helpful in lowering subscription rates, and foregoing Big Deals, as libraries 
are increasingly doing. 


RP: As you say, in your study you ignored social networking sites like 
Academia.edu and ResearchGate “in accordance with an emerging 
consensus from the OA community, and based largely on concerns about 
long-term persistence and copyright compliance.” And you also say, “The 
growing proportion of OA, along with its increased availability using tools 
like oaDOI and Unpaywall, may make toll-access publishing increasingly 
unprofitable, and encourage publishers to flip to Gold OA models.” I am 
wondering, however, if it is not more likely that sites like Academia.edu 
(which researchers much prefer to use than paying to publish or 
depositing in their repository) and Sci-Hub (which is said to contain most 
of the scientific literature now) will be the trigger that will finally force 
legacy publishers to flip their journals to open access, whatever one’s 
views on the copyright issues Would you agree? 


HP: It won’t be any one trigger, but rather an increasingly inhospitable 
environment. Sci-Hub is a huge contributor to that, and Academic Social 
Networks are too. Unpaywall opens up another front: a best-practice, legal 
approach to bypassing paywalls that librarians and others can unabashedly 
recommend. It all combines to make it easier and more profitable for 
publishers to flip, and for the future to be OA. 


RP: Thank you for answering my questions. 
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